Introduction to Open Data Science - Course Project

Chapter 1: About the project

Some thoughts about the course

I am feeling good about the course at the moment. In the beginning, I encountered some errors about connecting Rstudio and Github. I was a bit frustrating at that time, But after I managing to solve it, I feel encouraged.

I have used Rstudio a couple of times before without Rmarkdown. In this course, I expect to learn Rmarkdown and Github. Both of them are powerful tools and I believe that I will use them in my research in the future.

I found this course on sisu and I was attracted by its name, ‘open data science’

About the book R for Health Data Science

  • Chapter 1 gives all the basics one needs to know before learning R,
  • Chapter 2 to 4 introduces some basic commands(functions) to use R for data including summarizing and visualizing.

I find this book clear and well-organized. I would recommend this book to everyone who starts to learn R

End of Chapter 1


Chapter 2: Regression and model validation

This work includes performing and interpreting regression analysis. Specifically, this work includes plotting the data frame, fitting linear regression model, interpreting results and producing diagnostic plots.

2.1 read the data

The data is collected from ‘international survey of Approaches to Learning’, made possible by Teachers’ Academy funding for Kimmo Vehkalahti in 2013-2015. with some data wrangling, the final dataset i am working with contains 166 observations and 7 variables. The variables include gender, age, attitude, deep, stra, surf and points.

date()
## [1] "Tue Nov 29 02:30:26 2022"
#library
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ tibble  3.1.8     ✔ purrr   0.3.5
## ✔ tidyr   1.2.1     ✔ stringr 1.4.1
## ✔ readr   2.1.3     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
# read the data into memory
lrn14 <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/learning2014.txt", sep=",", header=TRUE)
#check the dimension and structure of the data
dim(lrn14)
## [1] 166   7
str(lrn14)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

2.2 Graphical overview of the data

# draw a scatter plot matrix of the variables in learning2014.
# [-1] excludes the first column (gender)
pairs(lrn14[-1])

# create a more advanced plot matrix with ggpairs()
p <- ggpairs(lrn14, mapping = aes(col = gender ), lower = list(combo = wrap("facethist", bins = 20)))

# draw the plot

p2 <- ggpairs(lrn14, mapping = aes(col = gender, alpha=0.3 ), lower = list(combo = wrap("facethist", bins = 20)))
p2

#summary of the data 
summary(lrn14)
##     gender               age           attitude          deep      
##  Length:166         Min.   :17.00   Min.   :1.400   Min.   :1.583  
##  Class :character   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333  
##  Mode  :character   Median :22.00   Median :3.200   Median :3.667  
##                     Mean   :25.51   Mean   :3.143   Mean   :3.680  
##                     3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083  
##                     Max.   :55.00   Max.   :5.000   Max.   :4.917  
##       stra            surf           points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00

Part 2.2 shows the graphical overview of the data. The distributions of the variables can be found on the diagonal of the plot. The relationships among the variables are observed on the off-diagonal. R console table summaries the variables.

2.3 fitting a regression model

# create a regression model with multiple explanatory variables
my_model <- lm(points ~ deep + stra + surf, data = lrn14)
#summary of the model
summary(my_model)
## 
## Call:
## lm(formula = points ~ deep + stra + surf, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.1235  -3.0737   0.5226   4.2799  10.3229 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  26.9143     5.1169   5.260  4.5e-07 ***
## deep         -0.7443     0.8662  -0.859   0.3915    
## stra          0.9878     0.5962   1.657   0.0994 .  
## surf         -1.6296     0.9153  -1.780   0.0769 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.827 on 162 degrees of freedom
## Multiple R-squared:  0.04071,    Adjusted R-squared:  0.02295 
## F-statistic: 2.292 on 3 and 162 DF,  p-value: 0.08016

From the summary of the model, I find that the stra has a positive statistical relationship with points(0.99), whereas surf variable has negative relationship(-1.63). the effect of deep on points is not significant because the p-value is bigger than 0.1(p-value is the result of a null hypothesis significance test for the true parameter equal to zero). Therefore, I remove the deep variable,include attitude and rerun the model

# create a regression model with multiple explanatory variables
my_model2 <- lm(points ~  attitude + stra + surf, data = lrn14)
#summary of the model
summary(my_model2)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

From the new model, I find that attitude has a significant positive impact on points. however, the effect of surf is not significant. I remove surf and rerun the model

# create a regression model with multiple explanatory variables
my_model3 <- lm(points ~  attitude + stra , data = lrn14)
#summary of the model
summary(my_model3)
## 
## Call:
## lm(formula = points ~ attitude + stra, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.9729     2.3959   3.745  0.00025 ***
## attitude      3.4658     0.5652   6.132 6.31e-09 ***
## stra          0.9137     0.5345   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

2.4 interpret the result of the model

From the summary of the model above, the estimated coefficient of attitude is 3.4658. The interpretation is that, holding other variables unchanged, one unit increase of attitude results in 3.46 unit increase of point. The positive effect is significant as the p-value of the coefficient is very low. using the same interpretation, the estimated coefficient of stra is 0.91, which means, holding other variables unchanged, one unit increase of stra results in 0.91 unit increase of point. However, this effect is only in 0.1 significance level as p-value is smaller than 0.1 but greater than 0.05. The adjusted R-square is 0.195. The interpretation of this figure is that 0.195 of the variability in the dependent is explained by the explanatory variables.

2.5 Diagnostic plots

# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5
par(mfrow = c(2,2))
plot(my_model3)

Linear regression models have four main assumptions: - Linear relationship between predictors and outcome; - Independence of residuals; - Normal distribution of residuals; - Equal variance of residuals.

From the diagnostic plot, I am able to observe properties of the residuals. From Q-Q plot, the residuals match normal distribution quite well. From Residuals vs fitted graph, the residuals are randomly distributed on both sides of fitted values, and the distance from both sides are quite the same. In this way, I interpret the validity of those assumptions based on the diagnostic plots.

End of Chapter 2


Chapter 3: Logistic regression

This work includes performing and interpreting logistic regression analysis. Specifically, this work includes plotting the data frame, fitting logistic regression model, and interpreting results .

3.1 read the data

This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). The two datasets were modeled under binary/five-level classification and regression tasks.

date()
## [1] "Tue Nov 29 02:30:32 2022"
#library
library(GGally)
library(ggplot2)
library(dplyr)
library(tidyverse)
# read the data into memory
library(readr)
alc <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv", show_col_types=FALSE)

3.2 Choose the variables of interest

The purpose of your analysis is to study the relationships between high/low alcohol consumption and some of the other variables in the data. In this exercise, I choose famrel, studytime, age and freetime variables and study the relationship between the 4 variables and alcohol consumption. I assume that the famrel and studytime have nagative impact on alcohol consumption, while age and freetime have positive impact on it.

3.3 Graphical interpretation of the variables

This subchapter uses the box plots to explore the distributions of my chosen variables and their relationships with alcohol consumption.

#distribution plots using bar plot
# initialize a plot of alcohol use
g_1 <- ggplot(data = alc, aes(x = alc_use))
g_1 + geom_bar()

g_2 <- ggplot(data = alc, aes(x = high_use))
g_2 + geom_bar()

g_3 <- ggplot(data = alc, aes(x = famrel))
g_3 + geom_bar()

g_4 <- ggplot(data = alc, aes(x = studytime))
g_4 + geom_bar()

g_5 <- ggplot(data = alc, aes(x = age))
g_5 + geom_bar()

g_6 <- ggplot(data = alc, aes(x = freetime))
g_6 + geom_bar()

# initialize a plot of high_use and famrel
g1 <- ggplot(alc, aes(x = high_use, y = famrel))

# define the plot as a boxplot and draw it
g1 + geom_boxplot() + ylab("family relationship")

# studytime and alchohol consumption
g2 <- ggplot(alc, aes(x = high_use, y = studytime))
g2 + geom_boxplot() + ylab("study time")

# age and alchohol consumption
g3 <- ggplot(alc, aes(x = high_use, y = age))
g3 + geom_boxplot() + ylab("age")

# freetime and alchohol consumption
g4 <- ggplot(alc, aes(x = high_use, y = freetime))
g4 + geom_boxplot() + ylab("freetime")

Part 3.3 shows the graphical overview of the variables. The distributions of the variables can be found in the bar plots. I find that the distributions of alcohol use,high_use, studytime and age is left-skewed. While the distributions of famrel and freetime are right-skewed. From the box plots, it seems that famrel and studytime are nagatively correlated with alcohol use. However, age and freetime seem to have no effect on alcohol use.

3.4 logistic regression models

This part runs the logistic regression with the variables selected above

# find the model with glm()
m <- glm(high_use ~ famrel + studytime + age + freetime, data = alc, family = "binomial")

summary(m)
## 
## Call:
## glm(formula = high_use ~ famrel + studytime + age + freetime, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7542  -0.8553  -0.6349   1.1722   2.2035  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.2910     1.8099  -1.818 0.069019 .  
## famrel       -0.3542     0.1315  -2.693 0.007086 ** 
## studytime    -0.5725     0.1576  -3.632 0.000281 ***
## age           0.2221     0.1023   2.170 0.029983 *  
## freetime      0.3793     0.1264   3.000 0.002701 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 415.07  on 365  degrees of freedom
## AIC: 425.07
## 
## Number of Fisher Scoring iterations: 4
OR <- coef(m) %>% exp
CI <- confint(m)
## Waiting for profiling to be done...
cbind(OR, CI)
##                     OR       2.5 %      97.5 %
## (Intercept) 0.03721707 -6.88776678  0.23054128
## famrel      0.70173583 -0.61460703 -0.09722191
## studytime   0.56414042 -0.89131071 -0.27196009
## age         1.24866753  0.02362678  0.42612657
## freetime    1.46128242  0.13517480  0.63202911

From the table, I observe that the estimated coefficients of famrel and studtime are -0.35 and -0.57 respectively, and they are statistically significant concluded from p-value. The results indicate that famrel adn studytime have negative relationship with alcohol consumption, which jusitifies my assumption before. On the other hand, The estimated coefficients of age and freetime are positive(0.22,0.38 respectively). The p-value of age coeffienct is 0.03, which indicates that the coefficient is significate at 0.05 level but not at 0.01 level. In addition, the odds ratio, which is exponential of the estimated coeffients, and confidence interval are given in the table. the exponents of the coefficients of a logistic regression model can be interpreted as odds ratios between a unit change (vs. no change) in the corresponding explanatory variable.

3.5 Prediction

##         prediction
## high_use FALSE TRUE
##    FALSE   238   21
##    TRUE     92   19

This part shows the predictive power of the model. A 2x2 cross tabulation is provided. The general predictive performance is good. However, a type of error is significant. When the high_use is true, the model always wrongly predict the type of alcohol use. The training error is 0.305, which is quite high. A better model may be needed. However, this model is still better than some simple guessing strategy.

3.6 Cross-validation

In this part,a 10-fold cross-validation on my model is performed.

# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = 0)
## [1] 0.3
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = nrow(alc))

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.3189189

The average number of wrong predictions in the cross validation is 0.32, which is larger than 0.26. Unfortunately, my model does not have better test set performance than the model in exercise 3.


Chapter 4: Clustering and classification

This work includes performing and interpreting clustering and classification. Specifically, this work includes plotting the data frame, fitting Linear discriminant analysis, prediction and K-means clustering .

4.1 read the data

Load the Boston data from the MASS package. The data is titled Housing Values in Suburbs of Boston and it contains 506 rows and 14 columns.

date()
## [1] "Tue Nov 29 02:30:34 2022"
#library
library(GGally)
library(ggplot2)
library(dplyr)
library(tidyverse)
# read the data into memory
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data("Boston")

# explore the dataset
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

4.2 Graphical interpretation of the variables

This subchapter uses the box plots to explore the distributions of my chosen variables and their relationships with alcohol consumption.

# draw a scatter plot matrix of the variables in learning2014.
# [-1] excludes the first column (gender)
pairs(Boston[1:7])

# draw the plot

p2 <- ggpairs(Boston[1:7], mapping = aes(alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
p2 

p3 <- ggpairs(Boston[8:14], mapping = aes(alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
p3

# calculate the correlation matrix and round it
cor_matrix <- cor(Boston) 

# print the correlation matrix
cor_matrix
##                crim          zn       indus         chas         nox
## crim     1.00000000 -0.20046922  0.40658341 -0.055891582  0.42097171
## zn      -0.20046922  1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus    0.40658341 -0.53382819  1.00000000  0.062938027  0.76365145
## chas    -0.05589158 -0.04269672  0.06293803  1.000000000  0.09120281
## nox      0.42097171 -0.51660371  0.76365145  0.091202807  1.00000000
## rm      -0.21924670  0.31199059 -0.39167585  0.091251225 -0.30218819
## age      0.35273425 -0.56953734  0.64477851  0.086517774  0.73147010
## dis     -0.37967009  0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad      0.62550515 -0.31194783  0.59512927 -0.007368241  0.61144056
## tax      0.58276431 -0.31456332  0.72076018 -0.035586518  0.66802320
## ptratio  0.28994558 -0.39167855  0.38324756 -0.121515174  0.18893268
## black   -0.38506394  0.17552032 -0.35697654  0.048788485 -0.38005064
## lstat    0.45562148 -0.41299457  0.60379972 -0.053929298  0.59087892
## medv    -0.38830461  0.36044534 -0.48372516  0.175260177 -0.42732077
##                  rm         age         dis          rad         tax    ptratio
## crim    -0.21924670  0.35273425 -0.37967009  0.625505145  0.58276431  0.2899456
## zn       0.31199059 -0.56953734  0.66440822 -0.311947826 -0.31456332 -0.3916785
## indus   -0.39167585  0.64477851 -0.70802699  0.595129275  0.72076018  0.3832476
## chas     0.09125123  0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152
## nox     -0.30218819  0.73147010 -0.76923011  0.611440563  0.66802320  0.1889327
## rm       1.00000000 -0.24026493  0.20524621 -0.209846668 -0.29204783 -0.3555015
## age     -0.24026493  1.00000000 -0.74788054  0.456022452  0.50645559  0.2615150
## dis      0.20524621 -0.74788054  1.00000000 -0.494587930 -0.53443158 -0.2324705
## rad     -0.20984667  0.45602245 -0.49458793  1.000000000  0.91022819  0.4647412
## tax     -0.29204783  0.50645559 -0.53443158  0.910228189  1.00000000  0.4608530
## ptratio -0.35550149  0.26151501 -0.23247054  0.464741179  0.46085304  1.0000000
## black    0.12806864 -0.27353398  0.29151167 -0.444412816 -0.44180801 -0.1773833
## lstat   -0.61380827  0.60233853 -0.49699583  0.488676335  0.54399341  0.3740443
## medv     0.69535995 -0.37695457  0.24992873 -0.381626231 -0.46853593 -0.5077867
##               black      lstat       medv
## crim    -0.38506394  0.4556215 -0.3883046
## zn       0.17552032 -0.4129946  0.3604453
## indus   -0.35697654  0.6037997 -0.4837252
## chas     0.04878848 -0.0539293  0.1752602
## nox     -0.38005064  0.5908789 -0.4273208
## rm       0.12806864 -0.6138083  0.6953599
## age     -0.27353398  0.6023385 -0.3769546
## dis      0.29151167 -0.4969958  0.2499287
## rad     -0.44441282  0.4886763 -0.3816262
## tax     -0.44180801  0.5439934 -0.4685359
## ptratio -0.17738330  0.3740443 -0.5077867
## black    1.00000000 -0.3660869  0.3334608
## lstat   -0.36608690  1.0000000 -0.7376627
## medv     0.33346082 -0.7376627  1.0000000
# visualize the correlation matrix
library(corrplot)
## corrplot 0.92 loaded
corrplot(cor_matrix, method="circle")

Part 4.2 shows the graphical overview of the variables. The distributions of the variables can be found in the second and third plots. The last plot is correlation plot which clearly shows the correlation among the variables.

4.3 Some modifications to the dataset

This part makes some modifications to the data. Specifically, I scale the data, Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). Use the quantiles as the break points in the categorical variable. Drop the old crime rate variable from the dataset. Divide the dataset to train and test sets, so that 80% of the data belongs to the train set.

library(MASS)
data("Boston")
boston_scaled <- as.data.frame(scale(Boston))
boston_scaled$crim <- as.numeric(boston_scaled$crim)
# summary of the scaled crime rate
summary(boston_scaled$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, labels = c("low","med_low","med_high", "high"), include.lowest = TRUE)

# look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
# number of rows in the Boston dataset 
n <- nrow(boston_scaled)
n
## [1] 506
# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

4.4 Linear discriminant analysis

In this part,a linear discriminant analysis on the test dataset is performed. The categorical crime rate is used as the target variable and all the other variables in the dataset as predictor variables. This part also draws the LDA (bi)plot.

# linear discriminant analysis
lda.fit <- lda(crime~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2400990 0.2648515 0.2475248 0.2475248 
## 
## Group means:
##                  zn      indus        chas        nox          rm        age
## low       0.9690353 -0.9670650 -0.06938576 -0.9020017  0.45045817 -0.9198026
## med_low  -0.1035499 -0.2594827  0.02203357 -0.5577483 -0.13781424 -0.2721560
## med_high -0.3826200  0.1680265  0.16075196  0.3899365  0.04929514  0.4102155
## high     -0.4872402  1.0171519 -0.03610305  1.0821320 -0.43718601  0.8139974
##                 dis        rad        tax     ptratio       black       lstat
## low       0.8808487 -0.6846906 -0.7487723 -0.50612811  0.38420640 -0.77649828
## med_low   0.3752996 -0.5514644 -0.4719765 -0.04464551  0.32107104 -0.10566476
## med_high -0.3517064 -0.4501310 -0.3275667 -0.25429443  0.05153048  0.05898798
## high     -0.8470894  1.6377820  1.5138081  0.78037363 -0.68808834  0.85991932
##                 medv
## low       0.55756881
## med_low  -0.03110516
## med_high  0.15365638
## high     -0.63420032
## 
## Coefficients of linear discriminants:
##                 LD1          LD2         LD3
## zn       0.13461247  0.693828231 -0.89133726
## indus    0.06291320 -0.340801254  0.51940810
## chas    -0.02683906 -0.008339373  0.11669596
## nox      0.40936586 -0.735360393 -1.45900213
## rm       0.03419348 -0.049192509 -0.10404152
## age      0.13766362 -0.374038701  0.04461133
## dis     -0.11171454 -0.372872078  0.35015045
## rad      3.65100851  0.916725301  0.06961680
## tax      0.01915599  0.105616942  0.45666665
## ptratio  0.16766850 -0.079239708 -0.31534634
## black   -0.06659298  0.038061443  0.13633532
## lstat    0.25916453 -0.263970355  0.21041033
## medv     0.14353928 -0.424161867 -0.33004949
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9565 0.0323 0.0112
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 4,col = classes)
lda.arrows(lda.fit, myscale = 3)

4.5 Prediction using LDA

In this chaper, I predict the crime classes with the test data.

# Work with the exercise in this chunk, step-by-step. Fix the R code!
# lda.fit, correct_classes and test are available

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes , predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       16      10        4    0
##   med_low    5      11        3    0
##   med_high   0       8       15    3
##   high       0       0        0   27

The cross tabulate results indicates a quite good results of prediction based on LDA. One exception is that when the correct data is med_high, the probability that the model wrongly predicts it with med_low is quite high(15/30).

4.6 Distance and k-means clustering

This part calculates the distances between the observations. Run k-means algorithm on the dataset and investigate the optimal number of clusters,

# Work with the exercise in this chunk, step-by-step. Fix the R code!
# scale the data
Boston<- as.data.frame(scale(Boston))
Boston$crim <- as.numeric(boston_scaled$crim)
# euclidean distance matrix
dist_eu <- dist(Boston)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.5760  4.9401  4.9913  6.3033 12.8856
# manhattan distance matrix
dist_man <- dist(Boston, method = "manhattan")

# look at the summary of the distances
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2645  8.9648 13.2765 14.1297 18.5263 46.5452
set.seed(13)
# k-means clustering
km <- kmeans(Boston, centers = 4)

# plot the Boston dataset with clusters
pairs(Boston[6:10], col = km$cluster)

set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

# k-means clustering
km <- kmeans(Boston, centers = 2)

# plot the Boston dataset with clusters
pairs(Boston[6:10],, col = km$cluster)

The optimal number of clustes are 2. This is determined by observing that the total WCSS drops radically when k is 2. In the pairs plot, I notice that the tax varibale seems to effect the clustering results mostly.

4.7 Super bonus

model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers',color = ~train$crime)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers',color= ~km$cluster[1:404])

(more chapters to be added similarly as we proceed with the course!)